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In  the  following  pages  we  wish  to  discuss  briefly  floating  point 
computation  as  performed  on  a  typical  digital  computer.  Our  objective 
will.be  two-fold:  to  illustrate  the  peculiarities  of  arithmetic  in  Buch 
an  environment  caused  by  the  imprecise  representation  of  the  real  number 
system,  and  to  indicate  how  various  choices  in  representation  and  arith¬ 
metic  algorithms  impinge  on  mathematical  software. 

t 

1.  Basic  Issues 

As  usual,  we  assume  a  positional  or  polynomial  representation  of 
whole  numbers,  i.e.  for  a  given  value  x  we  have 

nil 

X  =  p  (3)  =  E  d  3  (l) 

i=0  1 

where  P  is  the  base  or  radix  of  the  representation  of  x,  n  the  number  of 
base  3  digits  d^,  d^,  ..  ,  d^ with  d^  E  |o,l,  ...  ,  3~lj  often  ex¬ 
press  the  base  3  representation  of  x  in  (1)  as 

x  =  (dn-l  dn-2  ”•  d2  dl  do)  3  (2) 

As  usual  a  sign  is  prefixed  to  (1)  to  extend  the  representation  to  the 
integers.  Note  that  for  fixed  n,  it  is  obvious  that  (1)  or  (2)  can 
represent  only  values  in  the  range 

0  <  x  <  3n  -1  (3) 

(or,  with  a  sign  ,  -3°  +  1  £  x  £  3°  -1).  To  extend  (1)  for 

range  of  the  rationals,  negative  indices  or  exponents  are  permitted: 


*-Pn,m«»=.S  di& 
1=  -m 


yielding  n+m  digit  base  B  rationals,  i,e.,  we  can  rewrite  (A)  as 


X  +  (dn-l  dn-2 


.  d2  dj  dQ  .  d_1  d_2  ...  d_n)  p 


^NWCCTCD^ 


MS 

'  , 


Example  1. 

P,  .  (10)  «*  1.103  +  2.102  +  3.101  +  4.10°  +  5.10'1  +  6.10-2  +  7.10-3  + 
4.4 

8.10-4  -  (1234. 5678) 

P  ( 2 1  2  10-1-2-3 

3,3U;  =  1.2  +  1.2X  +  0.2  +  1.2  +  1.2  *  +  1.2  J  -  (110.111)2 

=  (6.875)1q 


Note  that  in  (4)  or  (5)  if  the  number  of  base  3  digit  to  the  right 
of  the  radix  point  is  fixed  a  priori  regardless  of  the  number  we  wish  to 
represent  we  have  a  fixed  point  notation  or  number  system  which  is  limited 
to  representing  magnitudes  in  the  range 

0  <  x  <  3n  -1 

as  before,  but  now  with  the  smallest  non-zero  magnitude  representable 
being  3  m.  With  the  addition  of  a  scale  factor,  s  range  of  representa¬ 
ble  numbers  can  be  expressed  as 

3s  (3n  -1)  <  x  j*  3s  (3n  -1) 

The  range  of  useful  values  in  our  system  of  representation  can  be 
greatly  increased  if  we  use  a  scheme  similar  to  "scientific  notation"  for 
physical  constants,  i.e.  quantities  are  represented  in  terms  of  signed 
fraction  or  mantissa  and  a  signed  exponent  or  characteristic.  Thus  we 
have  now  to  specify  the  number  of  digits,  e  and  the  base,  of  the 
exponent  and  the  number  of  digits,  m  and  the  base,  of  the  mantissa. 
Thus  we  have  a  two  part  representation  (f,c)  where  f,  the  fraction,  is  a 
fixed  point  number  with  base  3m»  ®  3m  “  digits  and  scale  factor  s  and  c, 
the  exponent  a  base  3e»  e  digit  fixed  point  number  with  scale  factor  sfi 
(usually  sg  =  0  hence  the  exponent  is  an  integer).  In  addition,  the 
exponent  may  be  biased,  i.e.  if  the  exponent  lies  in  the  range  -Mj  <  expo¬ 
nent  4  M2  then  to  avoid  the  explicit  sign,  may  be  added  to  all  exponents 
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hence  an  excess  -M^  notation.  If  the  scale  factor,  s  is  zero,  then  the 
radix  point  is  at  the  right  of  the  mantissa  (e.g.  Burroughs,  CDC  hardware) 
while  a  factor  of  S  ■  -m  yields  an  implied  point  at  the  left  hand  side 
(e.g.  the  IBM  S/360/  S/370)  or  |  f j  <1,  hence  3°  f  is  an  Integer  and 


-em<f  c  3® 

m  m 


Note  that  up  to  this  point  we  have  distinguished  between  3g»  the  ex¬ 
ponent  base  and  3  ,  the  mantissa  base.  Almost  without  exception  these  are 
m 

powers  of  the  hardware  base,  3y  which  is  usually  2.  (e.g.  for  the  IBM 

S/360-S/370,  3  =  2,  3  =  16).  To  avoid  fractional  exponents,  we  refer  to 
e  m 

3,  the  floating  point  base/radix  and  assume  3  “  3m*  Clearly  the  largest  e 
digit  exponent  then  is  3^  -1.  Hence  the  largest  representable  number  is 


equal  to 


oe 

(the  largest  mantissa)  x  3  Pe 


p 

Assuming  3  =  3vk  then  the  right  hand  terms  becomes  3y  ^  ^e  ^ ;  we  define 
the  exponent  range  to  be  k  (3^-1).  With  respect  to  the  mantissa  in  (7) 
if  the  scale  factor,  s  ,  is  zero,  then  the  largest  mantissa  is  3™  -1,  the 
smallest,  1.  If  s  =  -m  on  the  other  hand,  then  the  largest  mantissa  is 


3-m  (3m  -l)  -i  -e~m  »1. 


and  the  smallest  is  3 

Thus,  assuming  an  exponent  scale  factor  of  0,  we  can  denote  our 
floating  point  number  system  by  FL  (3g»  3 >  m,  e  s)  (9) 

where  3e  is  the  base  of  the  expcment,  3  the  base  (of  the  mantissa),  m.  the 
number  of  digits  in  the  mantissa,  e  the  number  of  digits  in  the  exponent 
and  s  the  (mantissa)  scale  factor. 

If  s  =  -m,  then  we  car.  simplify  (9)  to  FL  (pg,  p,  m,  e)  (10) 
with  (10)  the  representable  values  of  the  system,  denoted  s  (3e»  3>  m,  e) , 
are  given  by 

_ _ Ae  3e-l 

31  "^e  <  x  <  (1  -3'm)  3  6  x  »*0 


-Cl  -  3*"tn)  3  ^e-1  <  X  -e1_m 
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If  B  =  3  *  2,  then  S  (2,2,m,e)  becomes 


,l-m-2 


S  *  <  a-2'n)  2  2  -1  22  '1 


(12) 


-a-2-f)  22  -1  <  -2  f1-"-2  > 


The  precision  of  the  set  S  (3  ,  3,  m,  e)  is  defined  to  be  the  number  of 

e 

digits  representable  in  the  mantissa,  normally  in  base  3  digits,  i.e. 

^  0,  1,  ..  ,  3-1 j  ;  for  purposes  of  comparison  we  speak  of  the  binary  pre¬ 
cision,  the  number  of  the  bits  (base  2  digits)  representable  in  the 

mantissa.  The  range  of  FL  (3  ,  m,  e,  s)  is  defined  to  be  the  largest 

e 

representable  magnitude,  hence 


3e-l 

3s  (3m  -l)  3  e 


(13) 
2e  1 

or  with  3*=2  and  fractional  mantissae,  range  PL  (2,2,f,e)  =  (1-2  m)  2 
Example  2. 

For  some  typical  hardware  families  we  have 

Word  Size  Exponent  Mantissa 


Machine 

(bits) 

£. 

(bits) 

(bits) 

Burroughs 

6700/7700 

47  plus 

tag  8 

7  (S-M) 

39  (integer) 

S-M 

CDC  6600/ 
Cyber  70 

60 

2 

11 

(Excess 

210) 

48  (integer) 

l’s  C 

DEC  PDP-11 

32 

2 

8 

(Excess 

27) 

23  (fraction) 

S-M 

Honeywell 

H8200 

48 

2 

10 

7 

(Excess 

26) 

40  (Binary) 

10  (BCD) 

S-M 

16 


7  (Excess  2  ) 


(fraction) 
24  (fraction) 


IBM  S/370 


32 


S-M 


Thus  far,  we  have  ignored  the  questions  of  uniqueness;  clearly 
for  a  given  real  value  x  we  wish  to  have  a  representative  within 
S(0  ,  0,  m,  e,  s)  which  best  approximates  x  in  some  sense.  However, 
since  m  0  =  0  m  0  for  any  integer,  a,  we  choose  a  normalized 

representative,  i.e.  one  such  that  the  most  significant  digit  of  the 
mantissa  is  non-zero,  with  the  exponent  adjusted  accordingly  by  a  suit¬ 
able  choice  of  a.  The  corresponding  Normalized  Floating  Point  Number 
System  is  denoted  NFL  (0^,  0,  m,  e,  s) ,  or  if  fractional  mantissae  are 
assumed  NFL  (0^,  0,  m,  e)  (.*.  s  =  -m) .  Thus  the  smallest  representable 
(non-zero)  magnitudes  are  respectively 

0m  *  (s«0)  and  0  m  (0m  1)  =  0  1 


Example  3. 

For  a  binary  system  (0=0g  *  2)  with  m  bits  for  the  mantissae, 
e  for  the  exponent,  assuming  s  =  -m,  then  if  we  denote  the  set  of  values 
in  NFL  (2,2 ,m,e)  by  NS  (2,2,m,e),  x  e  NS  (2,2,m,e)  if 

2'2e^xf  (l-2_m)  2  2  “1,  x  =  £  0, 

n  _-nu  „2e-l  .  -2~2e 

or (1-2  )2  <  x  4. 

2.  Relations  In  The  Parameters 


In  the  preceeding  sections  we  discussed  the  basic  normalized 
floating  point  representation,  with  the  tacit  assumption  that  the  number 
of  base  0  digits  available  for  the  mantissa  and  the  number  of  exponent 
(base  0g)  digits  are  determined  by  hardware  considerations.  Here  we  wish 
to  examine  the  choices  available  or  implied  for  these  parameters  and  the 
tradeoffs  between  allocations  of  resources  to  one  versus  another  of  them. 
First,  let  us  assume  the  mantissa  size,  m,  is  fixed,  and  that  exponents 
are  positive.  Then  the  representation  ratio  of  the  number  of  values 
using  a  higher  base  that  can  be  represented  to  the  number  of  binary 
values  representable  in  the  range  of  binary  numbers,  representable  with 
the  available  number  of  bits. 
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For  example,  for  3“2,  in  a  normalized  floating  point  system  only 
half  of  the  possible  representable  values  are  used,  for  a  given  mantissa 
size,  i.e.  those  with  the  most  significant  bit  ■  1.  On  the  other  hand, 
when  a  leading  0  is  permitted  (3=4),  50%  more  values  are  representable. 
Similar  increases  occur  for  higher  values  of  3. 

Now  consider  NFL  (3  ,  3,  m,  e)  (.'.  s  =  -m)  ;  there  are  2e 
different  exponents  and  2m  different  normalized  mantissae  representable, 
if  3  *  3*  2.  Thus  the  total  number  of  positive  representable  values  with 

®  e+m_]_ 

positive  exponents  is  2  ;  since  the  largest  representable  mantissa  is 

=  1  and  the  largest  representable  exponent  is  2  -1,  the  largest  repre- 

2e_i  „  k 

sentable  binary  number  is  =  2  .If  we'  contrast  this  with  the  case  3=2  , 

£ 

with  numbers  of  the  form  m.3  and  estimate  the  number  of  values  less 

2e_i 

than  the  largest  representable  binary  number  (2  ),  assume  Msl  and  choose 

p  such  that  3P  2  2"  *  so  that  p  log  ^  &  2e-l.  Thus,  the  number  of 
representable  values  between  0  ^  and  3P  Is  approximately 

(2m“1  +  2  m-2  +  . . .  +  2  m“l0g  e)  (p+1)  =  2m  (1-3*1)  (p+1) 

where  m  is  the  number  of  binary  digits,  not  base  3  digits.  To  compute 

the  representation  ratio,  we  compare  the  total  number  of  base  3  values  in 
-1  2  6-l 

the  range  [2  ,2  ]  to  the  total  number  of  binary  values  in  that  range, 

ie 

£ 

Base  3  values  in  [2  \  22 

Representation  Ratio  =  Tg 

Binary  values  in  [2  x,  2Z  -1] 
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Example  4 . 


Let  0  =  16,  e  =  8;  then  p  = 


2e  -1 

log2  0 


log264 


Representation  Ratio 


2  (1-0  *)  (p+1)  2  (1-16'1) 


1  -1  p  log 


0.475 


1  +  2  .4 


It  is  easy  to  see  that  for  fixed  e,  m  there  are  about  1.875  times  as 
many  base  values  representable  as  base  2  values.  Hence,  for  positive 
exponents,  ^  (  0.475/1.875)  of  the  hexadecimal  values  are  in  the  range 
of  the  binary  values,  and  3/4  outside,  for  fixed  e,m.  Note  that  as  we  have 
seen,  more  numbers  are  representable  over  a  wider  range  by  using 

0=  2  ,  k>l,  on  the  interval  where  the  base  0  and  binary  values  overlap, 
the  binary  numbers  are  much  more  densely  distributed,  hence  do  a  better 
job  at  representing  that  interval  of  the  real  numbers. 

A  closely  related  issue  to  consider  is  the  choice  of  3  and  the 

size  of  e:  assume  the  word  size  n  =  m  +  e  is  fixed,  and  consider  the 

tradeoff  between  a  choice  for  e  (or  m)  and  the  choice  of  3.  Obviously 

one  criterion  is  the  effect  on  the  range  of  representable  numbers.  Let  the 

0  lc 

exponent  range  be  k  (0  -1  )  for  0=2  and  consider  the  worst-case  analy- 

6  k 

sis  of  the  accuracy  of  representation.  Clearly  if  0  =  2  ,  accuracy  will 
decrease  automatically  with  k,  since  m-k+1  bits  are  used  in  the  worst  case. 

Let  x  denote  the  floating  point  representation  of  x  in 

FL  (0  ,  0,  m,  e,  s).  Then,  the  absolute  representation  error  is  given  by 
^  e 

|  x  -  x  |  ,  the  relative  representation  error,  6(x) ,  by 


6  (x)  = 


hence 


x  =  x  (1  +  6(x)  ) 


An  upperbound  on  6  (x)  is  given  by  the  smallest  e  such  that 

/  6(x)  /  ^  e  V  x  e  FL  (0 , 0 ,  m,  e,  s) ;  we  refer  to  e  as  the  Maximum 

e 

Relative  Representation  Error  (MRRE)  for  the  machine  representation  of  x. 
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and,  since  3 


1-k 


15  k,  Exponent  Range  (NFL2> 
Exponent  Range  (NFL^) 


3  k-1  . 

v  7,  1 


K 


If  k  *  3  =  2,  then  the  exponent  ranges  are  almost  equal,  but  otherwise, 
for  a  fixed  MRRE ,  the  choice  3=2  clearly  yields  a  large  range  of  expo¬ 
nents,  indicating  that  for  these  criteria,  we  should  choose  3=2. 

With  respect  to  numerical  computations,  additional  parameter 

relations  must  hold.  Clearly,  for  NFL  (3  »  3  »  m,  e) ,  the  maximum  relative 
1  ® 

spacing  e  =  3  ""m  is  critical.  For  simplicity  (18) 

below,  let  us  denote  the  smallest  positive  number  in  our  system  by 

a  =  3"1  (  =  3  emin  "1,  s  =  -m)  (19) 

and  the  largest  representable  number  by 

X  =  (1-3  _m)  3  V1  (=  36maX  (l-3"m),s  =-m)  (20) 

Clearly  we  assume  3J&2,  e  ,  £  e  above.  To  ensure  that  1  is  included 

'  rn-ln  =■  max 


we  require  e  .  <.  1  C  e  and  to  ensure  that  e<  1,  we  must  have  n^.2. 

These  minimal  requirements  guarantee  a  meaningful  floating  point  system, 
but  to  produce  numerical  software  with  proveable  mathematical  properties, 
we  need  a  system  that  is  both  reasonably  large  and  well  balanced. 


j 
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Specifically  to  provide  a  useable  range  for  any  given  precision,  we  re¬ 
quire  that 


and 


o, 

X 


2 

<  e 

-2 

>  e 


(21) 

(22) 


(For  example,  in  Brown  s  algorithm,  for  the  mean  of  a  vector,  we  need 
4 

o  {  e  A,  to  avoid  overflow  when  accumulating  scaled  small  components;  in 

Lawson's  algorithm  for  the  Euclidean  norm  of  a  vector,  we  must  have 
-3/2 

A.  >  e  to  avoid  overflow  when  summing  the  squares  of  small  components 

2 

and  a  <  e  to  avoid  underflow.  Thus  (21)  is  essential  for  Lawson's  algo- 
righm,  while  (22)  provides  a  safety  factor)  .  Assuming  the  usefulness  of 
(21),  (22)  their  realism  needs  to  be  considered.  Clearly  for  (21),  (22) 


to  fail  we  would  have  a 

convenience,  we  restate 

small  range  and  relatively  high  precision. 

(21),  (22)  as 

For 

e 

min  2- 2m 

(23) 

e 

max  2m- 1 

(24) 

Example  5. 

An  extreme  example  of  high  precision  (3=2)  with  only  8  bits 
allocated  to  the  signed  exponent  is  provided  by  the  DEC  PDP-10  and  Honey¬ 
well  6000  series;  in  double  precision  64  bits  are  allocated  to  the  (signed) 
mantissae  from  a  word  size  of  72  bits.  These  machines  satisfy  (23),  (24) 
by  a  very  small  margin.  Assuming  word  size  n=72  with  e=8,  m=64  we  can  use 
an  implicit  normalization  to  yield  a  precision  of  64.  Likely  exponent 
ranges  (setting  one  value  aside  for  zero)  would  be  [-127,  127]  or  [-126,128]. 
The  proposed  inequalities  then  are  the  tightest  that  would  allow  both  of 
these  possibilities.  In  addition  to  (21),  (22)  we  would  prefer 

d  X  2;  1  (25) 

(cf.Reinsch)  but  (25)  is  neither  essential  nor  realistic;  instead  we  re¬ 
quire  the  weaker 

o  E  *  <  (tJA  )  <  e  X  (26) 

_2 

which  may  be  written  as  a  A.  <  e, 

,  '  ,2  .  -1 

and  a  A.  <  E 
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(27) 

(28) 


(Note  that  (21),  (22)  imply  o  e  *  a  X  <e\).  In  Lawson's  algorithm 

2,  1/2  2 

previously  cited,  we  must  have  a^X <  e  and  a  X  /I  to  ensure  that  the 
scale  factors  are  within  range,  hence  (27)  provides  a  modest  safety  factor 
and  (28)  a  larger  one.  Once  (27)  is  accepted,  symmetry  suggests  (28). 
Furthermore,  (27)  implies  (28)  in  practice,  since  o  X;fc 1  if  the  implicit 
radix  point  is  at  the  left  while  a  X  >  1  otherwise). 


(27),  (28)  may  be  re-written  as 


2  e  .  +  e  £  3-m 

min  max 


e  .  +  2  e  tm+1 

min  max 


(29) 

(30) 


If  e  .  +  e 

min  max 


0,  then  (29),  (30)  follow  from  (23),  (24).  However,  if 
2m  (e.g.  with  radix  point  on  the  right),  then  by  (24), 


e  .  +  e 

min  max 

e  ,  <  3- 3m,  hence  e  =  2™  -  e  .  2 5m- 3  and  e  _  -  e  ,  »8  m-6, 

min  -  *  max  min  max  min 


exactly  twice  the  exponent  range  specified  by  (23),  (24).  However,  most 
existing  hardware  with  one  implied  radix  point  on  the  right  has  relatively 


large  exponent  range  and  there  appears  to  be  no  floating  point  system  with 


o  E  el  failing  to  satisfy  (26). 


